fix: centralize UTF-8 file I/O for Windows compatibility#45
Conversation
Bare open() calls use the system encoding (cp1252 on Windows), causing 'charmap codec can't decode byte ...' errors when parsing repositories containing non-ASCII characters such as curly quotes. Adds utilities/file_io.py with open_utf8, read_json, write_json, and run_utf8 helpers, and migrates ~190 bare open() call sites across libs/openant-core/ (core, parsers, utilities, openant CLI, top-level scripts) to specify encoding="utf-8" explicitly. Also sets encoding/errors on the docker_executor subprocess.run that captures container stdout/stderr as text. Includes a regression test that scans non-test code for any bare open() call without an encoding= argument and fails if a regression reappears. Addresses item 9 from #16.
… UTF-8 Round 1 review fixes for PR knostic#45: - application_context.py, ast_parser.py, dataset_enhancer.py, report/__main__.py, report/generator.py: pass encoding='utf-8' on every Path.read_text() / write_text() call. The previous migration only covered open() calls; pathlib's text helpers also default to the system locale on Windows (cp1252) and crash on non-ASCII source code. - parsers/{c,go,javascript,php,ruby}/test_pipeline.py: pass encoding='utf-8', errors='replace' on subprocess.run(text=True) invocations of parser binaries and CodeQL. Only docker_executor.py was migrated before; these other call sites had the same Windows cp1252 hazard. - tests/test_file_io.py: extend regression scan with two new asserts — Path.read_text/write_text without encoding=, and subprocess.run(text=True) without encoding=. Refactored the call-walking logic into a shared helper. All 14 file_io tests pass; full tests/ suite: 98 passed, 22 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Manual verification
|
Local test resultsVerified the migration on Windows by static-grepping the worktree and round-tripping non-ASCII content through Commands run: Output: Outcome:
|
ar7casper
left a comment
There was a problem hiding this comment.
One thing I'd like your call on before merging: utilities/file_io.py defines open_utf8 /
read_json / write_json / run_utf8, but the migration inlines encoding="utf-8" at
every call site rather than routing through these helpers. So as it stands, the helper
module is only imported by its own test file — production code never touches it.
Three ways to go:
- Adopt the helpers in this PR — single chokepoint to change later (e.g. if we ever
want to switch toerrors="strict"or add logging). Bigger churn though, and it's literally
one keyword argument we'd be hiding. - Keep the helpers as the recommended pattern for new code — current state. Fine, but
they'll quietly rot unless we tell contributors to reach for them. Worth a one-liner in
libs/openant-core/CLAUDE.mdif you go this way. - Drop the helpers and rely on inlined
encoding="utf-8"+ the regression scanners as
the contract. Less surface area, one less thing to import.
The previous commit inlined encoding="utf-8" at every call site, but production code already used the file_io helpers (read_json, write_json, open_utf8) on master. This restores that pattern so the helpers are actually exercised and remain the single chokepoint for I/O policy. All four regression scanner tests (bare open, pathlib text I/O, dot open, text-mode subprocess) continue to pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng failure Bare open() without encoding= uses cp1252 on Windows, which would crash on non-ASCII dataset content. These calls had read_json on master. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three files that used helpers on master still had inline encoding after the previous fix pass: application_context.py (save_context write), parser_adapter.py (diff_filter report write), and agentic_enhancer/repository_index.py (load_index_from_file read). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace inline open/json.load and open/json.dump patterns with
read_json/write_json from utilities/file_io.py in the ten files
added by this branch that were still using bare I/O:
- core/diff_filter.py
- core/schemas.py (StepReport.write)
- openant/cli.py (all json.load calls in cmd_*)
- report/__main__.py (cmd_summary, cmd_disclosures)
- report/generator.py (merge_dynamic_results, generate_all)
- utilities/context_enhancer.py (checkpoint read/write paths)
- parsers/zig/{call_graph_builder,function_extractor,repository_scanner,unit_generator}.py
All 15 regression-scanner tests pass after this change.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace filepath.read_text + json.loads with read_json in check_for_manual_override, keeping read_text only for the .md parsing branch which needs the raw text for regex. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- reporter.py: open_utf8 for markdown report writes - docker_executor.py: open_utf8 for Dockerfile/compose/script writes - parsers/*/test_pipeline.py: read_json/write_json for all JSON reads and writes; open_utf8 for text file writes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- subprocess.run(..., encoding="utf-8", errors="replace") → run_utf8(...) in all test_pipeline.py files and docker_executor.py - Path.write_text(encoding="utf-8") → open_utf8 + f.write() in report/__main__.py and report/generator.py - Path.read_text(encoding="utf-8") → open_utf8 + f.read() in report/generator.py (load_prompt) Remaining read_text(errors="ignore/replace") calls in parsers and application_context.py stay inline — open_utf8 does not support the errors= override, and these reads are intentionally lenient on non-UTF-8 source files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
open_utf8 passes **kwargs through to open(), so errors="ignore" and errors="replace" work correctly. Replace all remaining Path.read_text inline uses in application_context.py, ast_parser.py, and dataset_enhancer.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joshbouncesecurity
left a comment
There was a problem hiding this comment.
Went with option 1. All production code (and the test pipelines) now routes through the helpers — read_json, write_json, open_utf8, and run_utf8 — rather than inlining encoding="utf-8" at every call site. The four regression scanner tests in test_file_io.py enforce this going forward and will catch any backslide on bare open(), Path.read_text/write_text, or subprocess.run(text=True) without the helpers.
The only remaining encoding="utf-8" references in production code are calls that also pass errors="ignore" or errors="replace" to read arbitrary source files — open_utf8 passes **kwargs through to open(), so those use open_utf8(path, errors="replace") etc. rather than read_text.
|
@ar7casper I think this is ok now |
Adds test_no_bare_path_calls_in_typescript_analyzer, a cross-platform scanner that greps typescript_analyzer.js for any path.relative/resolve/join() call not accompanied by toPosixPath() within a ±6-line window. This mirrors the PR knostic#45 pattern (test_no_bare_open, test_no_bare_pathlib_text_io, etc.) that prevents contributors from reintroducing encoding antipatterns. Scoped to typescript_analyzer.js only — other JS parser files legitimately call path.X() without toPosixPath() since they do not interact with ts-morph. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Bare
open()calls use the system encoding (cp1252 on Windows), causingcharmap codec can't decodeerrors on any target codebase containing non-ASCII characters (e.g. curly quotes U+2019, accented characters, CJK).Adds
utilities/file_io.pywith centralized UTF-8 helpers (open_utf8,read_json,write_json,run_utf8) and migrates ~190 bareopen()call sites acrosslibs/openant-core/(core, parsers, utilities, openant CLI, top-level scripts) to specifyencoding="utf-8"explicitly. Also setsencoding/errorson the docker_executorsubprocess.runthat captures container stdout/stderr as text.A regression test scans non-test Python files for any bare
open()call without anencoding=argument and fails if a regression reappears.Addresses item 9 from #16 (does not close the issue).
Test plan
open_utf8,read_json,write_json.run_utf8captures non-ASCII subprocess output (incl.universal_newlines=Truealias).run_utf8only injectsencoding/errorswhen caller asks for text mode; explicitencoding=is respected.open(in non-test code (test scanslibs/openant-core/excludingtests/and the helper).pytest tests/-> 96 passed, 22 skipped (env-dependent).charmaperrors.